Tag: data engineering

2026-05-23 • Alex Merced

An In-Depth Overview of the Apache Iceberg 1.11.0 Release

Apache Iceberg 1.11.0 delivers manifest list encryption, the new pluggable File Format API, credential lifecycle refresh...

2026-05-23 • Alex Merced

Single-Node Data Engineering: DuckDB, DataFusion, Polars, and LakeSail

Optimize single-node data engineering with DuckDB, DataFusion, Polars, and LakeSail. Compare architectures and learn whe...

2026-04-29 • Alex Merced

What Are Table Formats and Why Were They Needed?

Table formats like Apache Iceberg solved the ACID, schema, and performance problems that turned data lakes into data swa...

2026-04-29 • Alex Merced

The Metadata Structure of Modern Table Formats

Iceberg uses a metadata tree, Delta Lake uses a transaction log, Hudi uses a timeline. Here is exactly how each format o...

2026-04-29 • Alex Merced

Performance and Apache Iceberg's Metadata

Iceberg's three-layer metadata tree eliminates directory listing and enables multi-level data skipping. Here is how scan...

2026-04-29 • Alex Merced

Partition Evolution: Change Your Partitioning Without Rewriting Data

Iceberg lets you change partition schemes without rewriting data. Here is how partition evolution works internally and w...

2026-04-29 • Alex Merced

Hidden Partitioning: How Iceberg Eliminates Accidental Full Table Scans

Iceberg's hidden partitioning separates physical layout from user queries using transform functions. Here is how it work...

2026-04-29 • Alex Merced

Writing to an Apache Iceberg Table: How Commits and ACID Actually Work

Here is exactly how an engine writes to an Iceberg table, step by step, from data files through the atomic commit that m...

2026-04-29 • Alex Merced

What Are Lakehouse Catalogs? The Role of Catalogs in Apache Iceberg

Lakehouse catalogs store metadata pointers, manage namespaces, and enforce access control. Here is the complete catalog ...

2026-04-29 • Alex Merced

When Catalogs Are Embedded in Storage

S3 Tables and MinIO AI Stor embed the Iceberg catalog directly in the storage layer. Here is when embedded catalogs make...

2026-04-29 • Alex Merced

How Data Lake Table Storage Degrades Over Time

Iceberg tables degrade through small files, orphan files, metadata bloat, sort order decay, and partition skew. Here is ...

2026-04-29 • Alex Merced

Maintaining Apache Iceberg Tables: Compaction, Expiry, and Cleanup

Keep Iceberg tables fast with compaction, snapshot expiry, orphan cleanup, and manifest rewriting. Here is when and how ...

2026-04-29 • Alex Merced

Apache Iceberg Metadata Tables: Querying the Internals

Iceberg metadata tables let you query snapshots, files, manifests, and partitions using SQL. Here is every metadata tabl...

2026-04-29 • Alex Merced

Using Apache Iceberg with Python and MPP Query Engines

Access Iceberg tables from Python with PyIceberg, DuckDB, and Polars, or through MPP engines like Dremio, Spark, and Tri...

2026-04-29 • Alex Merced

Approaches to Streaming Data into Apache Iceberg Tables

Stream data into Iceberg with Spark Structured Streaming, Flink, or Kafka Connect. Here is how each works and the trade-...

2026-04-29 • Alex Merced

Hands-On with Apache Iceberg Using Dremio Cloud

A practical walkthrough of creating, querying, and optimizing Iceberg tables on Dremio Cloud, from account setup to AI-p...

2026-04-29 • Alex Merced

Migrating to Apache Iceberg: Strategies for Every Source System

Migrate to Iceberg from Hive, data warehouses, or raw files using in-place migration, full rewrite, or the zero-downtime...

2026-02-19 • Alex Merced

How to Think Like a Data Engineer

The median lifespan of a popular data tool is about three years. The tool you master today may be deprecated or replaced...

2026-02-19 • Alex Merced

How to Design Reliable Data Pipelines

Most pipeline failures aren't caused by bad code. They're caused by no architecture. A script that reads from an API, tr...

2026-02-19 • Alex Merced

Data Quality Is a Pipeline Problem, Not a Dashboard Problem

When an analyst finds null values in a revenue column, the typical response is to add a calculated field in the BI tool:...